Tags → #neural networks
-
Understanding AdaNorm
Understanding Adaptive Layer Normalization. First introduced in the DiT paper
-
Understanding Squared attention
Just a brief explanation of how attention mechanism works. As well as the quadratic scaling of attention.
-
Decoder Transformer
How I understand the Decoder Transformer in Generative Text Models